import AuthorDetails from '@site/src/components/AuthorDetails';

## [HFClient TGI](https://github.com/huggingface/text-generation-inference)

### Prerequisites - Launching TGI Server locally

Refer to the [Text Generation-Inference Server API](/api/local_language_model_clients/TGI) for setting up the TGI server locally.

```bash
#Example TGI Server Launch

model=meta-llama/Llama-2-7b-hf # set to the specific Hugging Face model ID you wish to use.
num_shard=1 # set to the number of shards you wish to use.
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data -e HUGGING_FACE_HUB_TOKEN={your_token} ghcr.io/huggingface/text-generation-inference:latest --model-id $model --num-shard $num_shard
```

This command will start the server and make it accessible at `http://localhost:8080`.


### Setting up the TGI Client

The constructor initializes the `HFModel` base class to support the handling of prompting HuggingFace models. It configures the client for communicating with the hosted TGI server to generate requests. This requires the following parameters:

- `model` (_str_): ID of Hugging Face model connected to the TGI server.
- `port` (_int_ or _list_): Port for communicating to the TGI server. This can be a single port number (`8080`) or a list of TGI ports (`[8080, 8081, 8082]`) to route the requests to.
- `url` (_str_): Base URL of hosted TGI server. This will often be `"http://localhost"`.
- `http_request_kwargs` (_dict_): Dictionary of additional keyword agruments to pass to the HTTP request function to the TGI server. This is `None` by default. 
- `**kwargs`: Additional keyword arguments to configure the TGI client.

Example of the TGI constructor:

```python
class HFClientTGI(HFModel):
    def __init__(self, model, port, url="http://future-hgx-1", http_request_kwargs=None, **kwargs):
```

### Under the Hood

#### `_generate(self, prompt, **kwargs) -> dict`

**Parameters:**
- `prompt` (_str_): Prompt to send to model hosted on TGI server.
- `**kwargs`: Additional keyword arguments for completion request.

**Returns:**
- `dict`: dictionary with `prompt` and list of response `choices`.

Internally, the method handles the specifics of preparing the request prompt and corresponding payload to obtain the response. 

After generation, the method parses the JSON response received from the server and retrieves the output through `json_response["generated_text"]`. This is then stored in the `completions` list.

If the JSON response includes the additional `details` argument and correspondingly, the `best_of_sequences` within `details`, this indicates multiple sequences were generated. This is also usually the case when `best_of > 1` in the initialized kwargs. Each of these sequences is accessed through `x["generated_text"]` and added to the `completions` list.

Lastly, the method constructs the response dictionary with two keys: the original request `prompt` and `choices`, a list of dictionaries representing generated completions with the key `text` holding the response's generated text.


### Using the TGI Client

```python
tgi_llama2 = dspy.HFClientTGI(model="meta-llama/Llama-2-7b-hf", port=8080, url="http://localhost")
```

### Sending Requests via TGI Client

1) _**Recommended**_ Configure default LM using `dspy.configure`.

This allows you to define programs in DSPy and simply call modules on your input fields, having DSPy internally call the prompt on the configured LM.

```python
dspy.configure(lm=tgi_llama2)

#Example DSPy CoT QA program
qa = dspy.ChainOfThought('question -> answer')

response = qa(question="What is the capital of Paris?") #Prompted to tgi_llama2
print(response.answer)
```

2) Generate responses using the client directly.

```python
response = tgi_llama2._generate(prompt='What is the capital of Paris?')
print(response)
```

***

<AuthorDetails name="Arnav Singhvi"/>